03 Alignment and Preprocessing

Right click to download this notebook from GitHub.


Once the data is made available as detailed in Data Ingestion , the next step is to ensure the data has been appropriately reshaped and aligned across data sources for consumption by the machine learning pipeline (or other analysis or simulation steps).

We'll be aggregating data across several years, using the same Landsat images used in the Walker Lake example. See that topic for more work on calculating the difference between the water levels over time.

In [1]:
import intake
import numpy as np
import xarray as xr

import holoviews as hv

import cartopy.crs as ccrs
import geoviews as gv

import hvplot.xarray

hv.extension('bokeh', width=80)

Recap: Loading data

In [2]:
cat = intake.open_catalog('../catalog.yml')
l5_da = cat.l5().read_chunked()
l5_da
Out[2]:
<xarray.DataArray (band: 6, y: 7241, x: 7961)>
dask.array<shape=(6, 7241, 7961), dtype=int16, chunksize=(1, 256, 256)>
Coordinates:
  * y        (y) float64 4.414e+06 4.414e+06 4.414e+06 ... 4.197e+06 4.197e+06
  * x        (x) float64 2.424e+05 2.424e+05 2.425e+05 ... 4.812e+05 4.812e+05
  * band     (band) int64 1 2 3 4 5 7
Attributes:
    transform:   (30.0, 0.0, 242385.0, 0.0, -30.0, 4414215.0)
    crs:         +init=epsg:32611
    res:         (30.0, 30.0)
    is_tiled:    0
    nodatavals:  (-9999.0,)
In [3]:
l8_da = cat.l8().read_chunked()
l8_da
Out[3]:
<xarray.DataArray (band: 7, y: 7941, x: 7821)>
dask.array<shape=(7, 7941, 7821), dtype=int16, chunksize=(1, 256, 256)>
Coordinates:
  * y        (y) float64 4.426e+06 4.426e+06 4.426e+06 ... 4.188e+06 4.188e+06
  * x        (x) float64 2.433e+05 2.433e+05 2.434e+05 ... 4.779e+05 4.779e+05
  * band     (band) int64 1 2 3 4 5 6 7
Attributes:
    transform:   (30.0, 0.0, 243285.0, 0.0, -30.0, 4425915.0)
    crs:         +init=epsg:32611
    res:         (30.0, 30.0)
    is_tiled:    0
    nodatavals:  (-9999.0,)

We can use this EPSG value shown above under the crs key to create a Cartopy coordinate reference system that we will be using later on in this notebook:

In [4]:
crs=ccrs.epsg(32611)

Preprocessing

The first step in processing data is to remove the missing values. In this case xarray self-reports the values assigned to nodatavals . We can use this information to set the missing values to NaN .

In [5]:
l5_da = l5_da.where(l5_da > l5_da.nodatavals[0])
l8_da = l8_da.where(l8_da > l8_da.nodatavals[0])

We can make sure that no more -9999s show up in the data, by calculating the minimum value in each DataArray as follows:

In [6]:
import warnings
with warnings.catch_warnings():
    warnings.filterwarnings('ignore', r'All-NaN slice encountered')
    l5_min = l5_da.min().compute()
    l8_min = l8_da.min().compute()
(l5_min, l8_min)
Out[6]:
(<xarray.DataArray ()>
 array(-750.), <xarray.DataArray ()>
 array(-2000.))

NOTE: These operations take a non-trivial amount of time because they require that the data actually be loaded.

Compute NDVI

Now we will calculate the NDVI (vegetation index) for each of these image sets.

In [7]:
NDVI_1988 = (l5_da.sel(band=5) - l5_da.sel(band=4)) / (l5_da.sel(band=5) + l5_da.sel(band=4))
NDVI_1988
Out[7]:
<xarray.DataArray (y: 7241, x: 7961)>
dask.array<shape=(7241, 7961), dtype=float64, chunksize=(256, 256)>
Coordinates:
  * y        (y) float64 4.414e+06 4.414e+06 4.414e+06 ... 4.197e+06 4.197e+06
  * x        (x) float64 2.424e+05 2.424e+05 2.425e+05 ... 4.812e+05 4.812e+05
In [8]:
NDVI_2017 = (l8_da.sel(band=5) - l8_da.sel(band=4)) / (l8_da.sel(band=5) + l8_da.sel(band=4))
NDVI_2017
Out[8]:
<xarray.DataArray (y: 7941, x: 7821)>
dask.array<shape=(7941, 7821), dtype=float64, chunksize=(256, 256)>
Coordinates:
  * y        (y) float64 4.426e+06 4.426e+06 4.426e+06 ... 4.188e+06 4.188e+06
  * x        (x) float64 2.433e+05 2.433e+05 2.434e+05 ... 4.779e+05 4.779e+05

NOTE: To calculate NDVI we have selected based on the band label. In xarray you can also select by index.

In [9]:
# Exercise: Try selecting by index like l8_da[0, 0, 0] or l8_da[0:1, 0:100, 0:100]

Combine data from overlapping grids

These two sets of Landsat bands cover roughly the same area but were taken in 1988 and 2017. While they have the same resolution (30m) and the same grid, they have different numbers of grid cells per image and different x and y offsets. We can see that by printing the first x value for each year and seeing that they are not equivalent.

In [10]:
NDVI_1988.x[0].item(), NDVI_2017.x[0].item()
Out[10]:
(242400.0, 243300.0)

Treating year as DataArray name

XArray allows arbitrary data to be combined into one xr.Dataset , though if the underlying grids are not aligned, then the shape of the dataset grows to be the union of the dimensions on each array. Here we'll combine the data from different years, providing the year as the name of the underlying DataArray:

In [11]:
ds = xr.Dataset({'1988': NDVI_1988, '2017': NDVI_2017})
ds
Out[11]:
<xarray.Dataset>
Dimensions:  (x: 7961, y: 7941)
Coordinates:
  * y        (y) float64 4.188e+06 4.188e+06 4.188e+06 ... 4.426e+06 4.426e+06
  * x        (x) float64 2.424e+05 2.424e+05 2.425e+05 ... 4.812e+05 4.812e+05
Data variables:
    1988     (y, x) float64 dask.array<shape=(7941, 7961), chunksize=(383, 256)>
    2017     (y, x) float64 dask.array<shape=(7941, 7961), chunksize=(5, 286)>

We can quickly sample one point, such as the center of Walker lake: latitude 38.6942° N, longitude 118.7081° W. First convert the lat, lon (PlateCarree) values to the coordinate reference system of the data (using the crs var we defined above):

In [12]:
x_center, y_center = crs.transform_point(-118.7081, 38.6942, ccrs.PlateCarree())

Then we'll select the data point nearest to this point, and use .compute() to actually load that data using dask :

In [13]:
ds.sel(x=x_center, y=y_center, method='nearest').compute()
Out[13]:
<xarray.Dataset>
Dimensions:  ()
Coordinates:
    y        float64 4.284e+06
    x        float64 3.514e+05
Data variables:
    1988     float64 -0.1643
    2017     float64 -0.5181
In [14]:
# Exercise: Use shift-tab completion to explore different interpolation methods for sel, and try using a different one ('backfill')

Select region of interest

We'll use the area around the central point as the Region of Interest (ROI). In this case we'll use a 30 km box around the center point.

In [15]:
buffer = 1.5e4
In [16]:
xmin = x_center - buffer
xmax = x_center + buffer
ymin = y_center - buffer
ymax = y_center + buffer
In [17]:
ROI = ds.sel(x=slice(xmin, xmax), y=slice(ymin, ymax))
ROI
Out[17]:
<xarray.Dataset>
Dimensions:  (x: 1000, y: 1000)
Coordinates:
  * y        (y) float64 4.269e+06 4.269e+06 4.269e+06 ... 4.299e+06 4.299e+06
  * x        (x) float64 3.365e+05 3.365e+05 3.365e+05 ... 3.664e+05 3.664e+05
Data variables:
    1988     (y, x) float64 dask.array<shape=(1000, 1000), chunksize=(225, 192)>
    2017     (y, x) float64 dask.array<shape=(1000, 1000), chunksize=(103, 222)>

Since the data are on the same coordinate system, when these data are visualized, the plots can be linked. For instance, if you pan one of the plots, the other will pan as well:

In [18]:
args = dict(cmap='viridis', width=500, height=500, crs=crs, rasterize=True)

NDVI_1988_p = ROI['1988'].hvplot(clim=(-3, 1), **args).relabel('1988')
NDVI_2017_p = ROI['2017'].hvplot(clim=(-3, 1), **args).relabel('2017')

(NDVI_1988_p + NDVI_2017_p).options(shared_axes=True)
Out[18]:
In [19]:
# Exercise: Pan around these plots, zoom in and out, reset.

We can also subtract the two ROIs, to visualize the difference between 2017 and 1988:

In [20]:
(ROI['2017'] - ROI['1988']).hvplot(crs=crs, clim=(-2, 2), rasterize=True,
                                   width=600, height=500, cmap='coolwarm').relabel('Difference in NDVI')
Out[20]:

Treating year as coords

Another way to join the data from the two different years, is by treating the years as coordinates. This approach is more logically sound, but sometimes global attribute data can be lost. For instance if the crs were different on the two dataarrays then that attribute would not appear on the output. To work around this, it is sometimes helpful to assign any attrs that differ to coords . This would look like NDVI_2017.assign_coords(crs=NDVI_2017.attrs['crs']) .

In [21]:
NDVI_by_year = xr.concat([NDVI_1988, NDVI_2017], dim=xr.DataArray([1988, 2017], dims=('year'), name='year'))
NDVI_by_year
Out[21]:
<xarray.DataArray (year: 2, y: 7941, x: 7961)>
dask.array<shape=(2, 7941, 7961), dtype=float64, chunksize=(1, 5, 256)>
Coordinates:
  * x        (x) float64 2.424e+05 2.424e+05 2.425e+05 ... 4.812e+05 4.812e+05
  * y        (y) float64 4.188e+06 4.188e+06 4.188e+06 ... 4.426e+06 4.426e+06
  * year     (year) int64 1988 2017

We'll do the same sampling at a point and select a ROI to demonstrate how years as coords differ from years as names.

In [22]:
NDVI_by_year.sel(x=x_center, y=y_center, method='nearest').compute()
Out[22]:
<xarray.DataArray (year: 2)>
array([-0.164319, -0.518072])
Coordinates:
    x        float64 3.514e+05
    y        float64 4.284e+06
  * year     (year) int64 1988 2017
In [23]:
ROI = NDVI_by_year.sel(x=slice(xmin, xmax), y=slice(ymin, ymax))

NOTE: It is simpler to define a series of subplots where the variable that is being iterated over is a coordinate.

In [24]:
p = ROI.hvplot('x','y', col='year', clim=(-3, 1), dynamic=False, **args)
print(p)
:GridSpace   [year]
   :Image   [x,y]   (value)

Since the output is a gridspace, we can select a a subplot from the output and alter it in place. For instance, we only really need the colorbar on the second subplot, so we can turn it off for the first one:

In [25]:
p[1988] = p[1988].options(colorbar=False)
In [26]:
p
Out[26]:

Now we can compute the difference:

In [27]:
ROI.diff('year').squeeze().hvplot('x', 'y', crs=crs, cmap='coolwarm', width=600, height=500, rasterize=True)
Out[27]:

Aligning data on different grids

Suppose that the two sets of Landsat images are on a different grid, which often happens because of differences in how data is collected and processed. To illustrate how to handle this case, we'll manually shift the 2017 coordinates slightly:

In [28]:
NDVI_2017['x'] = NDVI_2017.x+15
NDVI_2017['y'] = NDVI_2017.y+15

It is still possible to concatenate the arrays, but the x and y coordinates are now 2 times greater than when the arrays are on the same grid.

In [29]:
ds = xr.concat([NDVI_1988, NDVI_2017],  dim=xr.DataArray([1988, 2017], dims=('year'), name='year'))
ds
Out[29]:
<xarray.DataArray (year: 2, y: 15182, x: 15782)>
dask.array<shape=(2, 15182, 15782), dtype=float64, chunksize=(1, 5, 482)>
Coordinates:
  * x        (x) float64 2.424e+05 2.424e+05 2.425e+05 ... 4.812e+05 4.812e+05
  * y        (y) float64 4.188e+06 4.188e+06 4.188e+06 ... 4.426e+06 4.426e+06
  * year     (year) int64 1988 2017
In [30]:
# Exercise: Use .shape to look at the shape of NDVI_by_year defined above compared to the ds that we just defined

These data can be plotted just as before:

In [31]:
ROI = ds.sel(x=slice(xmin, xmax), y=slice(ymin, ymax))
In [32]:
ROI.hvplot('x', 'y', col='year', **args)
Out[32]:

Since the years don't share coordinates, the difference is all NaNs

In [33]:
ROI.diff('year').isnull().all().compute()
Out[33]:
<xarray.DataArray ()>
array(True)

To resolve this we can declare a new grid that spans the entire ROI with the same resolution as the original arrays (30m)

In [34]:
res = 30
x = np.arange(ROI.x.min(), ROI.x.max(), res)
y = np.arange(ROI.y.min(), ROI.y.max(), res)

We will use linear interpolation to calculate the values for each grid cell. This operation is not yet supported for dask arrays, so we'll first load in the data.

In [35]:
NDVI_1988.load()
NDVI_2017.load()
Out[35]:
<xarray.DataArray (y: 7941, x: 7821)>
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])
Coordinates:
  * y        (y) float64 4.426e+06 4.426e+06 4.426e+06 ... 4.188e+06 4.188e+06
  * x        (x) float64 2.433e+05 2.433e+05 2.434e+05 ... 4.779e+05 4.779e+05

Now we can use the new coordinates defined above to interpolate from the input coordinates to the new grid. The options are nearest and linear with linear being selected by default.

In [36]:
re_indexed_2017 = NDVI_2017.interp(x=x, y=y)
re_indexed_1988 = NDVI_1988.interp(x=x, y=y)
In [37]:
# Exercise: Try using linear interpolation instead of nearest. Use tab and shift-tab for help.
In [38]:
ROI = xr.concat([re_indexed_1988, re_indexed_2017],  dim=xr.DataArray([1988, 2017], dims=('year'), name='year'))
ROI
Out[38]:
<xarray.DataArray (year: 2, y: 1000, x: 1000)>
array([[[ 0.08708 ,  0.086702, ...,  0.076869,  0.084464],
        [ 0.087274,  0.072566, ...,  0.080221,  0.082222],
        ...,
        [ 0.100243,  0.139397, ..., -0.008975, -0.010067],
        [ 0.156222,  0.179923, ...,  0.01197 , -0.001034]],

       [[ 0.145014,  0.165371, ...,  0.065595,  0.065087],
        [ 0.167434,  0.199714, ...,  0.068627,  0.067301],
        ...,
        [ 0.153879,  0.132281, ...,  0.099584,  0.097955],
        [ 0.139706,  0.123239, ...,  0.098759,  0.097344]]])
Coordinates:
  * x        (x) float64 3.365e+05 3.365e+05 3.365e+05 ... 3.664e+05 3.664e+05
  * y        (y) float64 4.269e+06 4.269e+06 4.269e+06 ... 4.299e+06 4.299e+06
  * year     (year) int64 1988 2017

And now we can plot the difference between the two years:

In [39]:
# Exercise: Plot the difference between the two years using the same approach as we used above.

Next:

Now that the data have been aligned and preprocessed appropriately, we can then use them in subsequent steps in a workflow. For instance you might need to learn more about Resampling . Or you might already be ready for Machine Learning .